Chris Bail
Computational Sociology
Duke University
Last class we noted a number of problems that plague analysis of large, complex datasets:
-What do p-values mean in the age of “big data”?
-How should we handle non-linear relationships between predictors and outcomes?
-What about causal Complexity/rare events?
Today we are going to talk about some exciting new (or relatively new) tools that help address some of these problems from the interdisciplinary field of “machine learning” (a.k.a statistical learning), which comes out of statistics and computer science/engineering.
We are not likely to see these models in ASR or AJS anytime soon because linear models are so deeply ingrained within the discipline. But I think these new methods will eventually make their way into our journals, and in the meantime they serve a very important purpose: improving understanding of our datasets before we analyze them using linear models.
Most people use machine learning to make predictions: for example, data scientists use machine learning to predict which types of advertising campaigns will work, and biostatisticians use them to predict who will live and die.
We are going to use them in a slightly different way: we are going to employ machine learning techniques to better classify our data and identify complex relationships between predictors and outcomes. Once again, I will argue that we can use these methods to build better linear models, while we wait for sociology to catch up with the state of the art.
We could easily spend the entire semester discussing machine-learning. If you are curious about this topic- and especially if you want to examine the issue of prediction in greater detail- I recommend you check out this excellent online video course produced by two people who practically invented the field (at least within statistics)
http://www.r-bloggers.com/in-depth-introduction-to-machine-learning-in-15-hours-of-expert-videos/
Also, if you want to review what we do today, check out this amazing visual tutorial about machine learning:
Instead of going in depth into each technique, I am going to do what I always do in this class: try to give you enough of an overview that you could pursue this subject in more detail on your own. As always, we will work through messy real-world examples that I hope will help you see the strengths and weaknesses of these new models.
1) Generalized Additive Models
2) Regression/Classification Trees
3) Random Forests
Bonus:
Guest lecture by Alex Hanna on supervised topic
models!!!
One huge problem with many linear models is that they are parametric, or defined by functions that assume all predictors have the same relationship with the outcome of interest. By contrast, GAMs are non-parametric, meaning that the shape of the predictor functions is fully determined by the data.
We can of course perform non-linear transformations of variables in order to account for this problem, but in order to do this, we need to know that non-linearity is a problem in the first place.
Even if we are careful and examine the linearity of the relationship between each predictor and outcome in detail, such transformations are difficult to interpret: do you know what it means if a polynomial term is positive or negative in a regression model? What if there are multiple non-linear relationships within a model- as is often the case in social science data? To make matters worse, transforming multiple predictors within the same model can create colinearity issues.
Even if we can get around all of these problems, it would take us a very long time to do so…
Generalized Additive Models are a unique approach developed in 1986 by two statisticians. GAMs look a lot like traditional regression models, but they are much “smarter.” GAMs automatically select the best transformation of each variable FOR YOU- these can be both non-linear or linear.
After “Smoothing” or transforming each variable (where necessary) GAMS simultaneously estimate the relationship between each predictor and the outcome, outputting coefficients that are very similar to a regular linear model.
This not only gets us around the issue of having to try out different linear combinations of variables we have hunches about but also helps us account for potential non-linearities that we did not even know to look for.
GAMS can also support any type of link function, or in other words, any type of dependent variable that might be put into a generalized linear model (e.g. binary, continuous, etc.). They also have spiffy ways of dealing with missing data, without multiple imputation.
(Illustration by Kim Larson)
GAMs are kind of a compromise between conventional linear models which are almost always very biased but easy to interpret; and state of the art machine learning techniques such as random forests which are very good at representing/classifying relationships between variables but very difficult to interpret.
I think we will probably see GAMS in AJS and ASR before random forests… Note, however, that GAMs can be converted back into linear models, so this is what we may see in the near-term.
Let's look at an example that was put together by Michael Clark at Notre Dame these are science test scores from the PISA cross-national education data:
data <- read.csv("http://www.nd.edu/~mclark19/learn/data/pisasci2006.csv")
The outcome we are going to examine is the “Overall Science Score” of each student in a large group of countries.
Let's take a look at the data:
library(car)
scatterplotMatrix(data)
First, let's run a simple linear model
library(mgcv)
first_model <- gam(Overall ~ Income + Edu + Health, data = data)
summary(first_model)
Family: gaussian
Link function: identity
Formula:
Overall ~ Income + Edu + Health
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 121.18 78.97 1.535 0.1314
Income 182.32 85.27 2.138 0.0376 *
Edu 234.11 54.78 4.274 9.06e-05 ***
Health 27.01 134.90 0.200 0.8421
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) = 0.616 Deviance explained = 63.9%
GCV = 1212.3 Scale est. = 1119 n = 52
Now let's try a model that applies a smoother function to each predictor:
second_model <- gam(Overall ~ s(Income) + s(Edu) + s(Health), data = data)
The s() here tells the gam function that we want to apply a smoother
to each variable in the dataset- or that we want R to choose the appropriate
transformation for each predictor in the model.
The mgcv package will also create plots that describe the realtionship between each predictor and outcome!:
plot(second_model, pages=1, residuals=T, pch=19, cex=0.25,
scheme=1, col='#FF8000', shade=T,shade.col='gray90')
This “contour plot” lets us look at multiple variables at once:
vis.gam(second_model, type = "response", plot.type = "contour")
We can also do something called “tensor product smoothing” here which you can think of as a smoother that smooths the smoothers of multiple variables at once in order to create a three dimensional plot of the non-linear relationships between the predictors and the outcome.
third_model <- gam(Overall ~ te(Income, Edu), data = data)
vis.gam(third_model, type='response', plot.type='persp',
phi=30, theta=30,n.grid=500, border=NA)
Once again, I'm not sure why GAMs don't have more influence in sociology. At the very least, I encourage you to use them to triangulate your work with linear models.
Illustration from James et al. “An Introduction to Statistical Lerning, p. 308
Regression trees first separate data into “regions” based using something called recursive binary partioning. Essentially this means that the data are grouped into subsets of observations that have similar values on a predictor.
Next, regression tree algorithims simply take the mean value of all observations within each “region” or “subset” and use this to make a prediction about an outcome.
A brief example will help me illustrate this more clearly. We
are going to create a regression tree using the tree function
within the tree package and the built-in mtcars dataset
install.packages("tree")
library(tree)
tree.cars <- tree(mpg~., mtcars)
The mpg~. indicates we want to treat mpg as the outcome
but look at every other variable in the dataset as a
possible preditor.
One of the great things about regression trees is that they are very easy to interpret- even for those who have no idea what regression is!
Regression tree visualizations resemble a kind of “flowchart” within an organization- indeed, they are often called “decision” trees for this reason. The main difference is that regression trees describe how different predictors combine to produce an outcome (in this case miles per gallon of cars), instead of hiearchy within a group.
The first line below plots the regression tree and the second line adds labels to the plot:
plot(tree.cars)
text(tree.cars ,pretty =0)
The way to read these trees is to focus on the '<' sign. The branches to the left are less than and the branches on the right are greater than. the “wt” variable here describes weight (in tons), so the tree shows us that this is a big factor in mpg, which is not surprising. In fact, all cars less than 2.26 get about 30mpg. Those greater than 2.26 tons have two categories: those with less than six cylinders and those with eight cylinders. The V8 cars with the highest horsepower (hp) also have the lowest mpg.
Finding the best fit for the data usually requires “pruning”
the tree, or removing branches of the tree that add more
complexity but less explanatory power. To do this we use the
cv.tree command
prune.cars =prune.tree(tree.cars ,best =5)
plot(prune.cars)
text(prune.cars ,pretty =0)
In this case, our tree was already so simple that pruning it did not add parsimony.
Let's run through a quick classification tree example. The difference between a regression and a classification tree is simply that the former is for continuous outcomes and the latter is for binary or categorical outcomes.
Instead of selecting the average value of the outcome within each subset of the dataset, a classification tree selects the most common value.
Let's use the Pew data from a previous class in order to identify
patterns of support for the so-called “ground zero mosque”- pew10
-according to education, sex, age, income, and party identification:
load("Pew Data.Rdata")
tree.groundzero =tree(pew10~educ+sex+age+inc+partyln, pewdata)
plot(tree.groundzero)
text(tree.groundzero ,pretty =0)
Another neat party for creating trees is the party package.
In addition to the same tree structure generated by the two
packages above, this package adds plots of the cases within each branch
of the tree. It also uses a slightly different technique for
partionining regions in the data. Let's take a look:
install.packages("party")
library(party)
prettycartree <- ctree(mpg~., data=mtcars)
plot(prettycartree)
The main downside of regression trees is that they are not the most accurate tool available. In fact, if all relationships between predictors and outcoems are linear, then a linear model will be more efficient.
The principal reason that regression trees are not the most efficient way of making predictions about data is that they only use one “round” of subsetting the data. Therefore, regions of the data with high variance create problems.
But what if instead of partitioning the data only once, we did it hundreds of even thousands of times in slightly different ways and then averaged the results? This is what is known as bootstrapping, and in the language of regression trees, it is often described as “bagging.”
Random forests are an extension of bagging that includes a small tweak that “decorrelates” the bagged trees. It does this by taking a random sample of predictor variables within each subset, and then choosing one of them as a branch. The random selection of the predictor helps avoid the influence of one important predictor over other predictors that have a more moderate yet still meaningful association with the outcome.
Unfortunately, when we extract a large number of subsets from the data using bagging or random forests, we sacrifice the interpritibility a single regression tree. That is, we cannot construct a tree that visualizes all the bagged/randomly partioned datasets.
On the other hand, we can still measure how important each predictor is; or how important that “branch” is within the regression tree by examining the residual sum of squares.
Let's try it out:
install.packages("randomForest")
library(randomForest)
set.seed (123) # for reproducibility
rf.cars =randomForest(mpg~.,data=mtcars,importance =TRUE)
To create the variable importance plot, we use this line:
varImpPlot (rf.cars)
The first plot shows us the how much error we add to our model if we remove the variable from our model using the Mean Square Error.
The plot on the right describes the total increase in “node impurity” if the variable is removed from the model
Perhaps best way to figure out how solid your predictions are is to use a training dataset or subset of the data to create the model, and then see how well it predicts the outcomes. Unfortunately, I think this type of analysis will be too foreign to many reviewers in sociology, so I would not recommend presenting it as your main analysis. Once again, I do think it may be useful for enhancing your understanding of complex patterns within the dataset, however.
The most recent advance in regression trees is the concept of
boosting, which is similar to bagging but allows trees to grow
sequentially. If you want to learn more about these models
check out the book “An Introduction to Statistical Learning”
and the gbm package.
One of the most exciting parts of computational social science is our (relatively) newfound ability to simulate virtual societies. Such simulations can help us ex- plore links between the micro and macro; emergent phenomena more broadly, and help us get a better sense of when and where our data and models are wrong. Agent-based models can also help us develop new theories to test on data from the real world. More broadly, agent-based models are another way of making sense of the highly complex datasets that you might collect as part of this class.